Introduction to the tm Package Text Mining in R

نویسنده

  • Ingo Feinerer
چکیده

This vignette gives a short introduction to text mining in R utilizing the text mining framework provided by the tm package. We present methods for data import, corpus handling, preprocessing, metadata management, and creation of term-document matrices. Our focus is on the main aspects of getting started with text mining in R—an in-depth description of the text mining infrastructure offered by tm was published in the Journal of Statistical Software (Feinerer et al., 2008). An introductory article on text mining in R was published in R News (Feinerer, 2008).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Mining Infrastructure in R

During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis metho...

متن کامل

Data Mining in R using Rattle

T‎his paper is a brief introduction to the concepts, methods ‎and ‎algorithms ‎for ‎data ‎mining ‎in ‎statistical ‎software R ‎using a‎ ‎package ‎named ‎Rattle. Rattle ‎provides a‎ ‎good ‎graphical ‎environment ‎to ‎perform ‎some ‎of ‎the ‎procedures ‎and ‎algorithms ‎without ‎the ‎need ‎for ‎programming. ‎Some ‎parts ‎of ‎the ‎package ‎will ‎be ‎explained ‎by a‎ ‎number ‎of ‎examples.‎ ‎ ...

متن کامل

Topic Models in R

Topic models are a popular method for modeling the term frequency occurrences in documents. The fitted model allows to better estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables which are referred to as topics. The R package topicmodels provides basic infrastructure for fitting topic models based on data structur...

متن کامل

Nonparametric Distribution Analysis for Text Mining

A number of new algorithms for nonparametric distribution analysis based on Maximum Mean Discrepancy measures have been recently introduced. These novel algorithms operate in Hilbert space and can be used for nonparametric two-sample tests. Coupled with recent advances in string kernels, these methods extend the scope of kernel-based methods in the area of text mining. We review these kernel-ba...

متن کامل

topicmodels: An R Package for Fitting Topic Models

This article is a (slightly) modified and shortened version of Grün and Hornik (2011), published in the Journal of Statistical Software. Topic models allow the probabilistic modeling of term frequency occurrences in documents. The fitted model can be used to estimate the similarity between documents as well as between a set of specified keywords using an additional layer of latent variables whi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010